Many highly parallel algorithms usually generate large volumes of data containing both valid and invalid elements, and high performance\nsolutions to the stream compaction problem reveal extremely important in such scenarios. Although parallel stream\ncompaction has been extensively studied in GPU-based platforms, and more recently, in the Intel Xeon Phi platform, no study has\nconsidered yet its parallelization using a low-cost computing cluster, even when general-purpose single-board computing devices\nare gaining popularity among the scientific community due to their high performance per $ and watt. In this work, we consider the\ncase of an extremely low-cost cluster composed by four Odroid C2 single-board computers (SDCs), showing that stream\ncompaction can also benefitâ??important speedups can be obtainedâ??from this kind of platforms. To do so, we derive two parallel\nimplementations for the stream compaction problem using MPI. Then, we evaluate them considering varying number of\nprocesses and/or SDCs, as well as different input sizes. In general, we see that unless the number of elements in the stream is too\nsmall, the best results are obtained when eight MPI processes are distributed among the four SDCs that conform the cluster. To\nadd value to the obtained results, we also consider the execution of the two parallel implementations for the stream compaction\nproblem on a very high-performance but power-hungry 18-core Intel Xeon E5-2695 v4 multi core processor, obtaining that the\nOdroid C2 SDC cluster constitutes a much more efficient alternative when both resulting execution time and required energy are\ntaken into account. Finally, we also implement and evaluate a parallel version of the stream split problem to store also the invalid\nelements after the valid ones. Our implementation shows good scalability on the Odroid C2 SDC cluster and more compensated\ncomputation/communication ratio when compared to the stream compaction problem.
Loading....